A multi-omics dataset of human transcriptome and proteome stable reference

The development of high-throughput omics technology has greatly promoted the development of biomedicine. However, the poor reproducibility of omics techniques limits their application. It is necessary to use standard reference materials of complex RNAs or proteins to test and calibrate the accuracy and reproducibility of omics workflows. The transcriptome and proteome of most cell lines shift during culturing, which limits their applicability as standard samples. In this study, we demonstrated that the human hepatocellular cell line MHCC97H has a very stable transcriptome (r = 0.983~0.997) and proteome (r = 0.966~0.988 for data-dependent acquisition, r = 0.970~0.994 for data-independent acquisition) after 9 subculturing generations, which allows this steady standard sample to be consistently produced on an industrial scale in long term. Moreover, this stability was maintained across labs and platforms. In sum, our study provides omics standard reference material and reference datasets for transcriptomic and proteomics research. This helps to further standardize the workflow and data quality of omics techniques and thus promotes the application of omics technology in precision medicine.


Background & Summary
The booming applications of omics technologies provide unprecedented insights into biology and medicine. However, the reproducibility of omics technology has been questioned for a long time. A study in genomic sequencing showed zero sensitivity in finding pathogenic mutations using whole exome sequencing of 57 patients 1 . The mutation of 40 circulating tumor DNA (ctDNA) samples, sequenced by two individual companies, showed only 12% congruence 2 . In the field of RNA-seq, a wide variety of methodologies limits reproducibility 3 . In the field of proteomics, a study of Human Proteome Organization (HUPO) tested 20 highly purified recombinant human protein samples. Each protein contains one or more 1250 Da unique pancreatic peptides. The samples were then distributed to 27 laboratories for identification. The results found that only 7 of 27 laboratories reported all 20 proteins, whereas only 1 laboratory reported all 1250 Da trypsin peptides 4 . Another study found that even using optimal conditions and a uniform standard operating procedure, the median value of protein repeatability for the same mixture sample in different laboratories was only 75% 5 . In addition, the repeatability of multiple quantitative tests on the same sample in the same laboratory and between different laboratories is less than 80% 6 . Other studies show that omics research's lack of reliability and repeatability is one of the biggest obstacles to narrowing the gap between personalized medicine research and practice 7,8 . Detection of mycoplasma contamination. We used the Mycoplasma Detection Kit (ExCellBio, MB000-1591, China) to detect cell contamination by mycoplasma. Detailed operation steps were as follows: 1~1.5 mL cell  www.nature.com/scientificdata www.nature.com/scientificdata/ culture supernatant was put into a centrifuge tube and centrifuged at 13000 rpm for 5 min. Then the supernatant was discarded and the precipitate was washed once with PBS. 100 µL of lysate was added to lyse the cells, inverted upside down to mix well, and left at room temperature for 5 min. The cell lysate was then incubated at 95 °C for 5 min, centrifuged at 13000 rpm for 5 min, and the supernatant was transferred to a new centrifuge tube. The PCR reaction was carried out with toke 1~2 µL supernatant as a template, and the amplified products were electrophoresis by 2% agarose gel.

RNA extraction. Each generation of cells was cultured to 80~90% confluency, then washed twice with
RNase-free PBS (LEAGENE, IH0142, China), and then isolated by using TRIzol RNA extraction reagent (Invitrogen, 15596026, USA), detailed steps were as follows: Cells were collected in 1.5 mL EP tubes, centrifuged at 230 × g for 3 min, the supernatant was discarded and PBS was added to wash the cells. Then added 1 mL of TRIzol into a fume hood, mixed well, and placed at room temperature for 5 min, and added 200 µL chloroform and the mixture was vortexed vigorously for 15 s and placed at room temperature for 3 min. After centrifugation at 12000 × g at 4 °C for 15 min, the sample was divided into three layers (RNA in the upper aqueous phase), and the upper aqueous phase was sucked into a new EP tube. Then added 800 µL isopropyl alcohol, inverted and mixed, and placed at -20 °C overnight. The next day, centrifuged at 12000 × g at 4 °C for 30 min, and removed the supernatant. Added 1 mL of 75% ethanol to wash the RNA precipitation (precooled at -20 °C), centrifuged at 7500 × g at 4 °C for 5 min, and removed the supernatant, repeat this step once again. Used 20 µL of RNAase-free water to redissolve the pellets, and ran electrophoresis with 1% agarose gel after measuring RNA concentration.

mRNA-seq and RNC-seq.
For all mRNA-seq and RNC-seq, DNase I (Thermo Fisher Scientific, EN0525, USA) treatment was performed prior to the RNA library construction to remove DNA contamination according to the manufacturer's instructions. For mRNA-seq, our study used two methods to construct the sequencing library, including Oligo-dT method (PolyA + mRNA strategy) and the rRNA depletion method (Ribominus strategies).
We used Library Preparation VATHS mRNA Capture Beads (Vazyme, N401-02, China) to purify polyA + mRNA from total RNA. Then, according to different experimental designs, used the MGIEasy RNA Library Prep Kit (MGI, A0210, China) for MGI platform or VAHTS Universal V6 RNA-seq Library Prep Kit (Vazyme, NR604-01, China) for Illumina platform to constructed the sequencing libraries for Oligo-dT method, according to each manufacturer's instruction. Before the rRNA depletion sequencing libraries were constructed, rRNA was removed from total RNA by probe hybridization followed by RNase H degradation as we previously reported 35,36 . And the rRNA depletion sequencing libraries were also constructed by using the MGIEasy RNA Library Prep Kit according to the manufacturer's instructions too.
For RNC-seq, only Oligo-dT method was used for library construction, which was the same as mRNA-seq. Among all the mRNA-seq and RNC-seq data involved in this project, only the mRNA library construct by Sagene Co. Ltd. was sequenced by NovaSeq-6000 platform (Illumina, China) for 300 cycles, and the rest were sequenced on a MGISEQ-2000 platform (MGI, China) for 210 cycles.
The high-quality reads were subjected to the subsequent bioinformatics analysis. The adapter sequences were trimmed from the reads. Then reads were mapped to transcripts using the hyper-accurate mapping algorithm FANSe3 37,38 in the next-generation sequencing analysis platform "Chi-Cloud" (http://www.chi-biotech.com). Gene expression levels were quantified using the RPKM (reads per kilobase per million reads) method 39 . Genes with at least 10 reads were considered quantifiable genes 40 .

RNA degradation.
We tested the RNA samples under degradation conditions to investigate how the degradation affects their applicability to serve as a reference, and to test whether our procedure could tolerate the degradation and provide comparable results as the non-degraded counterparts.
For RNA samples that were slightly degraded during RNA extraction, we used Oligo-dT method and the rRNA depletion method for library construction. The probe sequences used in the rRNA depletion method were listed in supplementary Table 1. The library construction method for RNA-degraded samples treated with RNase A was as follows: We randomly selected a complete total RNA sample without degradation from the 1~9 generations of MHCC97H (Fig. 3a). For example, in the third generation, the concentration of extracted total RNA was 1299.16 ng/µL, the total volume was 20 µL, and 29 µg of total RNA was finally obtained. We performed a series of RNase A degradation experiments, each of which contained 2 µg total RNA of the third-generation cell line as starting material. 1 ng RNase A was added to each of the five experimental groups (except the non-degraded group), and then reacted for 30 s, 1 min, 2 min, 5 min, and 1 h, respectively. Finally, 0.5 U RNaseOUT (Thermo Fisher Scientific, 10777019, USA) was added for the termination reaction. All experiments were operated at room temperature. The library was constructed by the rRNA depletion method, and evaluated in the same way as described above (paragraph "mRNA-seq and RNC-seq").
www.nature.com/scientificdata www.nature.com/scientificdata/ protein trypsin digestion. Each generation of cells was cultured to 80~90% confluency, then treated with 0.25% trypsin-EDTA (Gibco, 25200056, USA), centrifuged at 300 × g for 5 min at room temperature, washed twice with PBS, and the supernatant was removed by centrifuge. Cells were dissolved in 1% SDS lysis buffer (Beyotime, P0013G, China) and the protein concentration was measured by a BCA quantification kit (Thermo Fisher Scientific, 23227, USA). The protein digestion was performed by filter-aided sample preparation (FASP) 41 method. In brief, protein samples were treated with 8 M urea (8 M urea in 0.1 M Tris-HCl, pH 8.5), resulting in a final concentration of urea ≥4 M. Next, an appropriate amount of DTT was added to a concentration of 50 mM and incubated at 37 °C for 1 h. Iodoacetamide solution (IAA) (Merck, I6125, USA) was added to a concentration of 120~150 mM, and incubated at room temperature for 30 min in the dark. Each solution was transferred into a 10 KDa ultrafiltration tube (Merck, UFC501096, USA) and centrifuged at 12000 × g for 15 min. The filter tube was washed twice with 8 M urea (200 µL each time) and then washed three times (200 µL each time) with 50 mM triethylammonium bicarbonate (TEAB) (Thermo Fisher Scientific, 90114, USA). Finally, trypsin (Promega, V5280, USA) was added at the ratio of 1:40 (trypsin: protein), and incubated at 37 °C overnight. After 16 hours, all peptides were collected by centrifugation at 12000 × g for 20 min. Then washed the filter tubes twice with 50 mM TEAB (200 µL each time), and all eluted peptides were collected and mixed. Their concentrations were determined using the Pierce Quantitative Fluorometric Peptide Assay kit (Thermo Fisher Scientific, 23290, USA). Finally, all peptides were lyophilized and stored at -80 °C.
Data-dependent acquisition (DDA) mass spectrometry. For data-dependent acquisition analysis, data were collected by Q Exactive Plus (QE+) mass spectrometer equipped with EASY-nLC 1000 system (Thermo Fisher Scientific, USA) and Orbitrap Fusion Lumos mass spectrometer equipped with EASY-nLC 1200 system (Thermo Fisher Scientific, USA) respectively. QE+ parameter setting. Each injection consisted of 2 µg of peptides and 1 µL of standard peptides (iRT peptides) (Biognosys, Ki-3002-2, Switzerland). The samples were separated by a 100 µm × 20 mm, 5 µm C18 nano trap column (Thermo Fisher Scientific, AAA-164564, USA) and a 75 µm × 250 mm, 2 µm C18 analytical column (Thermo Fisher Scientific, 164941, USA), respectively. In the analytical column, the samples were eluted at a flow rate of 300 nL/min for 120 min, and the elution gradient was:  Data-independent acquisition (DIA) mass spectrometry. The mass spectrometry data were collected using QE+ and Orbitrap Fusion Lumos for data-independent acquisition analysis, respectively. QE+ parameter setting. Each injection consisted of 2 µg of peptides and 1 µL of iRT peptides. Samples were analyzed in the data-independent acquisition method. The liquid conditions were the same as in the data-dependent acquisition method mentioned above. The parameters of the mass spectrum were set as follows: MS1 resolution: 70000; MS2 resolution: 17500, m/z range: 400 to 1200 m/z, variable acquisition windows: 30, AGC target: 3e6, injection time: 60 ms, NCE: 27%, AGC target: 1e6, max injection time: auto.
The common search parameters: Type: standard, multiplicity 1; Digestion: digestion mode(specific), enzyme, trypsin/P; Variable modification: oxidation(M), acetyl (protein N-term); Max number of modifications per peptide: 5; Missed cleavage sites were allowed: 2; Label-free quantification: LFQ; LFQ minimum ratio count: 2; Fast LFQ was selected; LFQ minimum number of neighbors: 3; LFQ average number of neighbors: 6; Instrument: orbitrap; Fixed modification: carbamidomethyl (C); Two missed cleavage sites were allowed. We adopted the criteria for confident identification with a false discovery rate (FDR) < 0.01 at peptide and protein levels. RNA and protein quantification. For mRNA-seq and full-length translating mRNA-seq (RNC-seq) data, our study used the FANSe3 algorithms. The sequence mapping of FANSe3 can be referenced to the human transcriptome database. The mRNA abundance was normalized using RPKM. www.nature.com/scientificdata www.nature.com/scientificdata/ For protein quantification analysis, label-free mass spectrometry data were quantified with the iBAQ (intensity-based absolute quantification) algorithm as provided in MaxQuant. Remove missing values from protein quantitative data before performing median normalization.

Data Records
All the sequencing datasets are available at the NCBI Gene Expression Omnibus (GEO) with dataset identifier GSE234201 42 . All the mass spectrometry raw data are publicly available on iProX with the accession number PXD041292 43 . Details of all omics data are shown in supplementary Tables 2, 6.

technical Validation
Quality control of cells. To find a cell line with stable transcriptome and proteome during long-term subculture, we tested 5 commonly used cell lines: MHCC97H, HCCLM6, HCCLM3, Hela, and A549. In order to ensure the quality of cell lines, we detected mycoplasma at intervals to ensure that all cell lines were mycoplasma negative (Fig. 2a~d).

RNA quality control.
For each cell line, we cultured 8~12 generations and took samples from each generation. Total RNA was extracted from each sample and the RNA quality was examined by electrophoresis to verify that they were not degraded (Fig. 3a~e). omics data quality control. We conducted quality control of sequencing data (including RNA-Seq and RNC-Seq) and generated a series of QC metrics. The overall quality of the sequencing dataset was satisfied at the level of raw and mapped data in the following aspects: (1) the average reads count of raw sequencing data was more than 20 M; (2) the average mapping ratio of all samples was around 72%; (3) the average GC content of the www.nature.com/scientificdata www.nature.com/scientificdata/ data generated from all samples was around 52%; (4) the average rRNA contamination ratio for all samples was around 4.57%; (5) the average Q30 of all samples was around 89%. The detailed results of data quality control are showed in supplementary Table 7.
Reproducibility of transcriptome datasets. We used the polyA + mRNA method to construct a transcriptome library for sequencing and used the RPKM method for quantification. The mutual correlation of gene expression showed that the MHCC97H has the most stable transcriptome, with the Pearson r = 0.983~0.997 (Fig. 4a). The other two hepatocellular carcinoma cell lines showed lower consistency (r can be as low as 0.973 and 0.960, respectively, Fig. 4b,c). The Hela and A549 cell showed even lower consistency over the generations (average r = 0.979 and 0.964, respectively, with the lowest value being 0.948 and 0.920, respectively, Fig. 4d~f).
Subsequently, we performed a series of rigorous evaluations on the stability of MHCC97H at the transcriptome level. Firstly, we subcultured two batches of MHCC97H cell lines in May and December 2021, respectively. The mutual correlation of gene expression over generations was similar (r = 0.977~0.997, Fig. 5a). The correlation of the same generation between the two batches was steadily high (r = 0.998 ± 0.001, Fig. 5b). Secondly, we tested the robustness over experimenters and labs. The results (Fig. 5c~d) were almost identical to the former experimenter (Fig. 4a). We also sent 4 samples to two commercial sequencing service providers more than 1000 km away. Chi-Biotech Co. Ltd. was equipped with a MGISEQ-2000 platform, and Sagene Co. Ltd. was equipped with a NovaSeq-6000 platform. The Pearson r reached 0.979~0.991 and 0.970~0.991, respectively (Fig. 5e), and the mutual correlation of gene expression over generations was similar in both labs (Fig. 5f).
We then tested different mRNA enrichment strategies. Our standard protocol used oligo-dT to enrich polyA + mRNA (mature mRNA), which was applied in most studies. Another strategy was the rRNA depletion method, which removes rRNA by probe hybridization followed by beads extraction or RNase H degradation. Using the rRNA depletion method, the Pearson r = 0.962~0.986 (Fig. 5g), which was considerably lower than the oligo-dT method. As expected, the correlation of gene expression between the two strategies was slightly lower (r was only 0.864~0.896) (Fig. 5h), suggesting that the data generated by two different mRNA enrichment strategies should not be mixed for analysis. www.nature.com/scientificdata www.nature.com/scientificdata/ The RNA is vulnerable to ubiquitous RNases and environmental changes (e.g. freeze-thaw cycles). Therefore, minor or major degradation might be inevitable during the production, storage, and transport of the standard samples. We generated transcriptome sequencing datasets of RNA-degraded samples to investigate how the degradation affects their applicability to serve as a reference. Firstly, we created a scenario that mimics the degradation due to environmental exposure: the RNA samples were exposed to the air at room temperature for a prolonged time (more than 2 hours), so that the RNases in the environment may enter the tube and degrade the RNA. Then, the samples were frozen and thawed for 10 cycles. The result of agarose gel electrophoresis showed that the total RNA of MHCC97H was degraded to various extents (Fig. 6a). Surprisingly, when we used oligo-dT method, the non-degraded RNA samples and their corresponding RNA degraded samples still had high transcriptome correlation (Pearson r = 0.980~0.993, Fig. 6b). The rRNA depletion method also showed remarkable consistency with the counterparts of non-degraded and degraded RNA samples (r = 0.965~0.983, Fig. 6c), but still lower than the oligo-dT method.
Most environmental RNases are exonucleases, which may remain the 3'-end of the mRNAs. However, endonucleases may degrade mRNA into smaller fragments. We added RNase A into the non-degraded MHCC97H total RNA and incubated for 30 seconds to 1 hour to create a series of endonuclease-degraded samples (Fig. 6d). However, the endonuclease-degraded samples showed considerably lower consistency compared to the non-degraded counterparts (r = 0.872~0.941, Fig. 6e), but still much higher than the correlation reported by other literature (r 2 = 0.41~0.69) 13 .

Reproducibility of translatomic and proteomic datasets.
It is generally known that translational regulation is the most significant regulatory level 22 . Therefore, we first tested the stability of the MHCC97H translatome over subculture generations. The RNC-seq of the MHCC97H showed very high mutual consistency (Pearson r = 0.974~0.996, Fig. 7a). Mass spectrometry requires more steps than RNA-seq, making it difficult to achieve high reproducibility. However, the protein abundance detected using mass spectrometry was comparable both in data-dependent acquisition mode (r = 0.966~0.988) (Fig. 7b) and in data-independent acquisition mode (r = 0.970~0.994) (Fig. 7c), respectively. To test the variability contributed by the experimental procedures, we started from the same trypsin-digested sample and made 3 independent mass spectrometry measurements (including LC-MS and data analysis). Such technical replicates yielded r 2 = 0.945~0.949 and r 2 = 0.975~0.990 in data-dependent acquisition (Fig. 7d) and data-independent acquisition (Fig. 7e) modes, respectively. These www.nature.com/scientificdata www.nature.com/scientificdata/ results indicated that the variance contributed by biological nature and trypsin digestion could be neglected in the DDA mode, and merely distinguishable in the DIA mode.
Next, we tested the robustness of the standard proteome sample across labs and instruments. We distributed the same batch of standard proteome samples to 4 labs (JNU, SCUT, DICP, and BNU), which were over 2000 km away (Fig. 8a). The samples were shipped using ice boxes at 0 °C for 3 days. All labs followed the same protocol to process the samples. The only hardware differences were listed in Fig. 8b. A similar number of protein were identified in the 4 labs (Fig. 8c~d). The JNU lab yielded slightly more proteins due to the long column, which provides higher chromatography resolution. The SCUT lab yielded fewer proteins due to the lower resolution and slower scanning speed of the mass spectrometer. However, the distribution of the isoelectric points (pI) of the identified protein showed no significant differences among these labs (Fig. 8e). The protein abundance measured by these labs was highly comparable (r = 0.962~0.974, Fig. 8f left). In data-independent acquisition mode, the labs with the same instruments showed highly similar results (r = 0.962), while the SCUT lab, which was equipped with another model of mass spectrometer showed remarkably lower consistency to the other two labs (r = 0.912) (Fig. 8f right), demonstrating that the instrument-specific bias and spectrum analysis software cannot be neglected.

Code availability
The data analysis methods, software, and associated parameters used in the present study were described in the section of Methods. If no detailed parameters were described for the software used in this study, default parameters were employed. No custom scripts were generated in this work.